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ABSTRACT 



The basic objective of the study was to determine the validity 
of four new indices of item quality. Three of these were based on 
analyses of differential, empirical weights for item choices, and 
the fourth was designed to measure the relative attractiveness of 
distracters. A secondary objective was to ascertain the validity 
of the conventional discrimination index. 

To attain these objectives, multiple-choice items designed to 
vary in quality with respect to nine common item-writing principles 
were prepared. The quality of each item was rated independently by 
three judges, and the average of their ratings was used as the 
criterion to determine the validity of the indices. 

The special test items were administered to a sample of college 
undergraduates, and the five indices were computed on the basis of 
their responses. 

The data were analyzed, and the conventional discrimination 
index was found to be a moderately valid measure of item quality. 

The weighted combination of the new indices also appeared to be 
valid. Because all of the new indices did not operate in the way 
expected, however, it is suggested that further research on them 
is necessary before they arc considered for practical use in test- 
construction projects. 
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CHAPTER 1 



MEASUREMENT OF ITEM QUALITY 



The* Problem 

To write multiple-choice items of high quality for aptitude 
and achievement tests requires a thorough knowledge of the subject 
matter and skills that are to be measured, highly developed writing 
skills, ingenuity in conceiving and casting testable ideas into 
proper form, and psychological insight into the probable reactions 
of different groups of examinees to the items. Because item writing 
is so complicated a skill, and because the component characteristics 
of items of high quality have never been adequately defined, satis- 
factory measurement of item quality has been, at best, difficult 
to achieve . 



The Background of the Problem 

In the past, attempts to measure item quality have made use 
of subjective judgments, indices of item difficulty, and indices of 
item-choice correlation with a criterion variable (usually total 
score on the test in which the item is included) . Commonly, only 
the correlation coefficient between the dichotomy of marking or not 
marking the keyed choice and the criterion variable has been com- 
puted. Subjective judgments have been less than satisfactory, 
partly because a clear indication of important points tu be 
considered has not been available to the judges and partly because 
of the inherent unreliability of judgments of the type involved. 
Conventional item-analysis data, on the other hand, sometimes are 
helpful in detecting defective items and aid in the selection and 
revision of items for inclusion in the final version of a test. 
Despite these attempts to measure item quality, inspection of 
achievement and aptitude tests indicates that a relatively large 
number of faulty items are not identified during test construction. 
Consequently, it seems desirable to investigate systematically the 
validity of conventional item-analysis data for measuring item 
quality as well as to examine the effectiveness of some new methods 
for measuring item quality that have received little attention in 
the past. 



New Measures of Item Quality 

Three new indices, suggested by Davis (1959) , incorporate 
information provided by conventional item-analysis data with 
information on the choice -criterion coefficients for unkeyed 
choices in a manner designed to make the resulting indices parti- 
cularly sensitive to specific aspects of item quality. These three 
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indices, plus a fourth devised for use in this study, were determined 
for eacli of 54 specially prepared items in such a way that their 
usefulness in judging item quality could be estimated and compared 
directly with the usefulness of conventional item- ana lysis data. 

Three sets of judgments of the quality of the 54 items were used as 
criteria of item quality. These judgments were made with the aid 
of a guide list of critical points to be considered in evaluating 
the quality of multiple-choice items. 



T he Purpose of the Study 

The basic problem for study is, then, the measurement of the 
quality of multiple-choice test items. Specifically, the effective- 
ness of the best-weighted combination of the new indices is compared 
with that of conventional item-analysis data. 
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CHAPTER 2 



FAULTS IN MULTIPLE-CHOICE ITEMS: 
A SURVEY OF THE LITERATURE 



The Emphasis upon Avoiding Faults 
in MuItip1.e-Clio.Tcc Items 



Wesman (1971) suggests that during the last two decades the 
emphasis on item-writing principles in textbooks on educational 
measurement has increased greatly. The current emphasis on this 
topic is revealed by a recent survey of the literature by Masonis 
(1971) , which resulted in a list of forty-seven principles for 
writing multiple-choice items. Violation of most of these princi- 
ples leads, logically, to the construction of faulty items (i.e., 
items of low quality) . It is interesting to note that the list 
contains several contradictions that result fi'om disagreements 
among item-writing experts regarding principles. However, there 
appears to be widespread agreement among experts on many of these 
principles. For example, thirty-four writers suggest that "All 
options should be plausible for the uninformed student" (Masonis, 
1971, p. 93). In the study reported here, special items were 
written that vary with respect to nine item-writing principles. 
Eight of these principles appear in the list compiled by Mahonis, 
and six of them were suggested by nine or more writers. The 
principle that items should be unambiguous is the only principle 
used in this study that is not explicitly included in the list, 
but it is implied by several of the other principles. The effects 
of following these widely recommended principles and thus avoiding 
certain faults, however, has received relatively little attention 
in the literature. 



The Effects of Faults on Scores 
on Multiple-Choice Tests 

Some of the ret*. ''arch on test-wisoness provides data on the 
extent to which examinees use certain kinds of faults to advantage 
in determining their responses to multiple-choice items. One 
approach that has been used to measure the variation due to the 
advantageous use of faults involves a comparison of the total scores 
obtained by groups of examinees on sets of items that are designed 
to measure the same points but which vary with respect to their 
quality. Millmari and Setijadi (19GG) used this approach to deter- 
mine the extent to which test-wiseness exists in samples of American 
and Indonesian students. They used multiple-choice items with 
plausible distraeters and multiple-choice items with implausible 
distracters. In general, the latter were easier than items with 
plausible distraeters. Furthermore the difference in performance 
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on the two typos of items was greater for Americans, who were known 
to have had more experience responding to multiple-choice items, 
than for Indonesians. Although no tests of statistical, significance 
were conducted, Millman and Sctijadi suggest that the type of fault 
examined may have a differential effect on the performance of ex- 
aminees with varying levels of test-taking experience. 

Another approach tlv-t has been used to measure test-wisoness 
requires the construction of items that deal, with very obscure or 
fictitious material and the incorporation of certain faults that 
examinees may use? to raise their scores on the test above the level 
that most likely would be expected to occur as a result of chance 
alone. For example, some of the items used by Slaktcr ert al . (1970b) 
to measure tost-wisenoss included one option each that resembled the 
stem of the question. The items dealt with fictitious content so 
that examinees could not answer the questions on the basis of know- 
ledge. A test-wise examinee; in terms of "these items,' was defined 
as one who had a tendency to select options that resemble the stems. 
Significant over all differences were found among examinees in grades 
five through eleven on test-w.i.scness items that contained four types 
of faults, .including the one described above. An important limita- 
tion of studies that measure test-wiseness in the manner just 
described is that the results may be appropriately generalized only 
to performance on tests that arc extremely difficult, which is nty, 1 ; 
typical, of most tests used in educational situations. 

In general, both approaches to the study of the particular 
aspect of test -wi senes s under consideration have indicated that an 
important source of variation in test scores may be attributable to 
faults that are present in test items. These studies, however, only 
have been concerned with types of faults that may aid examinees in 
determining the keyed choices to multiple- choice items. It should 
be noted that some of the most serious faults in items make it more 
difficult for an examinee to select the correct choice even ;d>.cn hr; 
has a substantial amount of information about the point being tested. 
For example, an ambiguity in the stem of an item may mislead and 
cause a knowledgeable examinee to select an incorrect choice. In 
such items, there is no response that clearly should be chosen on 
the basis of the principles of test-wisoness alone. The study that 
is presented in this report is concerned with the identification of 
both types of faults in multiple-choice items. 

Another limitation of these test-wiseness studies is that, in 
a strict sense, the results appropriately may be generalized only 
to items with faults that are similar in nature and degree. Wesinan 
(1971) suggests that fhe limited general] zability of item-writing 
studies probably is responsible for the paucity of research in this 
area. The seriousness of this limitation with respect to one parti- 
cular fault was demonstrated by Chase (1. 00 >! ) . He found that when 
responding to very difficult items in which one choice in each item 
was longer than the others, examinees tended to choose the extra- 
long choices only when these choices were three times as long as 
the others. When these choices wore only onc-and-a-lial.f to two times 
as long as the other choices, the extra-long choices did not appear 
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to affect examinees' performance . Furthermore , the tendency to 
select choices that were three times as long disappeared when 
each of the difficult .items with an extra-long choice was preceded 
by very easy items in which the extra-long choices were clearly 
incorrect. Thus, this study indicates that the widely-recommended 
principle that keyed choices should be no longer than the distrac- 
ters may he an important principle in terms of its effect on 
examinee performance only under certain circumstances. 

Despite the limitations discussed above, a sufficient number 
of such studies (c.g., Chase, 1969: Mi liman. 1.900: Slakter, 1970: 
and Wahl strom and Bucrsma , 1908) have identified variation in test- 
scores apparently attributable to certain kinds of faults in test 
items to warrant the hypothesis that careful analysis of the re- 
sponses of examinees may aid in the measurement of item quality. 

This hypothesis is consistent with much of the literature concerning 
the uses of the conventional discrimination index, which is reviewed 
in a later section. 

Logically, if faults irrelevant to the points being tested 
account for some of the variation in responses to test items, tests 
composed of faulty items should be less valid than those composed 
of faultless items. Studies of tes t-wi soness generally have been 
concerned with the extent to which the trait exists among examinees 
and with its correlates (such as sex and grade) rather than the 
effects of such faults on the characteristic reliability and 
validity of tests. To the best of this writer's knowledge, only 
two studies have boon conducted to determine the effects of various 
faults on these test characteristics. Dunn and Goldstein (19n‘J) 
found that tests composed of items containing cues to the correct 
choice, extra-long correct choices, and inconsistencies in grammar 
between the stem and incorrect choices are less difficult than 
identical tests that do not have those characteristics . The presence 
or absence of these characteristics did not significantly affect the 
reliability or validity of any of the tests used in their study. 

Board and Whitney (1972), on the other hand, obtained somewhat dif- 
ferent results in an unpublished investigation of the effects of 
four types of faults on test items. In general, they found that 
the faults that they examined benefited poorer students more than 
better students, that significantly lower reliability coefficients 
were obtained as a result of three types of faults, and that signi- 
ficantly lower validity coefficients occurred as a result of all 
four types of faults. The differences in the studies cited above 
suggest that the conditions under which faults affect test validity 
and reliability are not fully understood. 

In spite of the contradictory evidence regarding the effect 
of item faults on test reliability and validity, certain principles 
of item writing are widely recommended by tes L-construction experts. 
The stress placed upon following these principles, in fact, may be 
justified solely in terms of their effect on the public acceptance 
of multiple-choice tests. For instance, items that have not been 
written in accordance with established item-writing principles are 
a source of concern to subject-mat ter specialists and scholars, such 
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as Hoffmann (1962) . Because a great ileal of the critic. ism of 
multiple-choice test items springs from a lack of scholarly pre- 
cision in writing and editing them, it appears important to conduct 
studies leading to the validation of conventional measures of item 
quality and to the development of new and, it is hoped, better 
measures . 



Identi.fi cation of Faulty Items by Means oF 
Conventional Item-Analysis Data 



In formal test-construction projects, drafts of test items 
usually are administered to samples of examinees representative of 
those with whom the items ultima Lely are to be used. On the basis 
of examinee responses, estimates arc made of each item's difficulty, 
of the attractiveness of each choice, and of the ability of each 
choice to discriminate among examinees of high and low ability in 
the trait to be measured. Many methods for arriving at these esti- 
mates have been proposed. The merits anil deficiencies of the various 
estimates as well, as their relationships to each other and to over 
all test characteristics have received a great deal of attention in 
the literature. Some of these considerations are discussed in the 
section of Chapter 3 that describes the conventional index of item 
quality used in this study. The basic purpose of this section of 
the report, however, is to review the literature that deals expli- 
citly with the use* of conventional indices to identify faults in 
individual, test items. 

If, as suggested in the previous section, some of the variation 
in test scores is attributable to faults in test items, the presence 
or absence of faults should affect the difficulty levels of individual 
test items. The use of item-difficulty indices to detect faulty items, 
however, is not straightforward because of two factors. First, a 
considerable amount of variation in difficulty indices normally is 
expected to occur as a result of the levels of abilities in examinees 
with respect to the points being tested. Furthermore, some types of 
faults, such as the presence of an ambiguity in the stem of an item, 
are likely to increase item difficulty while others, such as the 
inclusion of implausible distraeters, arc likely to decrease an 
item’s difficulty from what it otherwise would be. In light of these 
considerations, it is not surprising that the use of information on 
item difficulty as an aid in detecting faults is not recommended in 
the literature. 

Indices of choice attractiveness, on the other hand, apparently 
are more helpful in the process of identifying faulty items. Speci- 
fically, it has been suggested that distraeters that are chosen by 
very few or none of the examinees should be regarded as implausible 
anti be replaced (o.g., Adams, 1.90*1, p. 357; Ahinann and Glock, 1971, 
p. 192; Henryssen, 1971, pp. 130-137; and Thorndike and Hagen, 1909, 
p. 127). Index 3,. one of the new indices of item quality investigated 
in this study, is designed to provide an over all. indication of the 
quality of each item with respect to the relative attractiveness of 
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its dis tractors. 

Apparently, j.ndieo?s of the extent to which items discriminate 
between those high in ability and those w in ability on some 
criterion variable also may be used in identifying faulty items. 
Numerous writers suggest that items that discriminate poorly should 
be inspected closely for possible deficiencies (e.g., Anastosi, 
1908, pp . 170-171; Davis, 1909, pp. 20-27; and Gulliksen, 1950, 
p. 365). To aid in the process of inspecting questionable items, 
it is commonly recommended that separate tabulations be made of 
the number of high-ability examinees and low-ability examinees who 
marked each choice. Illustrations of how faults may bo detected 
in this manner are presented in the literature for items that have 
an unnecessary similarity between the keyed choice and the stem 
(Ahmann and Clock, 1971, pp. 193-19M); that have distractcrs that 
may be too close in meaning to the keyed choice (ITenrysscu, 1.971, 
pp. 130-137): that are tricky (Libel, 1.905, p. 309); and that are 
designed poorly (bbcl. , 1.965, p. 371). 

I'cK! tors other than the presence or absence of faults may 
cause discrimination indices to vary from item to item. Misinfor- 
mation on the part of examinees has been cited widely as one such 
factor (e.g., Anns Las i , 1908, p. 1.70; Dav.is, 1951, p. 300; libel 
1905, p. 372) . Tims > despite the numerous individual, illustrations 
in the "literature showing how the discrimination index may be used 
to identify items with faults, their over all effectiveness as 
measures of item quality is not clear. One of the contributions 
of this study is that such a determination is made. 



CHAPTER 3 



A STUDY OF THE VALIDITY OF MEASURES 
OF ITEM QUALITY 



QlieS t i oils to Ho Answere d 



Mult jple -choi co items of low quality can bc> found in standard- 
ized achievement and aptitude tests despite the emphasis on avoiding 
faults in the literature and the widespread use of the conventional 
discrimination index in item- selection and revision procedures. In 
light of this fact, a clear need exists to investigate systematically 
the validity of the conventional measure of item quality as well as 
to determine the usefulness of some promising now measures of item 
quality. This study was undertaken to accomplish these objectives. 
The conventional discrimination index was expressed in terms of the. 
Davis Discrimination Index. Three of the new indices selected for 
investigation are based on choice-weight scores and a fourth measures 
the relative attractiveness of d is tractors. All five indices were 
dctei'm.inod for each item in two parallel forms of a 27- item arith- 
metic reasoning test. The* criterion for determining the validity 
of the indices was the average rating of c-acli item’s quality by 
three expert judges. 

The data described above were obtained in order to answer the 
following specific questions: 

la. What is the validity of the conventional discrimination 
index for measuring item quality in each random half of the group 
of examinees on each form of the test? 

lb. For each form, is the average of the validity coefficients 
obtained in the two halves of the examinees significantly different 
from zero? 

2a. What 
the new indices 
of examinees on 

2b . What 
the measurement 
each form? 

2e. What is the cross-validated multiple-correlation coeffi- 
cient between the weighted composite of the new indices and the 
criterion in each half of the examinees on the two forms? 



is the validity of the best-weighted combination of 
for measuring item quality in each half of the sample 
caeli form of the test? 

is the relative contribution of each new index to 
of item quality in each half of the examinees on 



2d. For each form, is the average of the two cross-validated 
coefficients significantly different from zero? 

3. For each form, is Liie average validity coefficient for 
the conventional index for both halves of the examinees significantly 
different from the average cross-validated multiple-correlation 
coefficient for the two halves? 
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Coi is Lrisc t lon of the Special 
Mill t ipi o-Chnjcc I terns 

For the purposes of this study, it was necessary to write 
items that would he heterogeneous with respect to their quality. 

First, nine commonly recognized characteristics of items of high 
quality were identified, ns follows: 

1. Presence of an adequate keyed choice; 

2. Absence of distractors that' can be defended as adequately 
correct because of ambiguities in expressing the meanings 

- of the stem and the choices; 

3. Absence of d.is tractors that can be defended as adequately 
correct when the stem and choices are unambiguous in 
meaning; 

l l . Absence of ambiguity caused by the use of a negative or 
double negatives; 

5. Absence of distractors that arc implausible because of 

a lack of homogeneity with each other and with the keyed 
choice ; 

6. Absence of distractors that arc implausible when all 
choices are relatively homogeneous and the prcsenqgj^f' 
naturally attractive dis traders ; 

7. Absence of an extra-long or precisely worded keyed choice; 

8. Absence of logically overlapping distrneters ; 

9. Presence of grammatical agreement of the stem with the 
choices . 

Next, two arithmetic-reasoning items were written to conform 
to tlie specifications represented by eaeli oF the nine characteristics. 
Thus, eighteen items of high quality were made available. 

Then, two arithmetic-reasoning items were written in such a 
way as to make them slightly faulty with respect to each of the nine 
characteristics. Thus, eighteen items of medium quality were made 
avail abl e . 

Finally, two a r i. thmet ie- r e a s on i ng items wore written in such 
a way as to make them seriously faulty with respect to each of the 
nine characteristi es . 

The faults that were incorporated into the items needed to be 
of such a nature that they would not adversly affect the examinees’ 
motivation and acceptance of the tests as legitimate measures of 
arithmetic-reasoning ability. Hence, there was a practical re- 
striction on the extent to which the items could be made heter- 
ogeneous with respect to their quality. Three sample items are 
shown in Appendix A. 

In summary, there were eighteen items designed to be "fault- 
free, ” eighteen designed to be moderately faulty, and eighteen 
designed to be seriously faulty. The items were matched in terms 
of the type and extent of fault, and one member of each matched 
pair of items was randomly selected for inclusion in form A of the 
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test; the remaining item was included in form 13. In eons true ting 
t lie* parallel forms, an u ttompt was not made to match items in terms 
of the specific cir.i thmc* Lie reasoning skills that they were designed 
to measure. 



Administration of the Special Items 
to t he Val idation Group 

The two parallel forms of the arithmetic-reasoning test, 
consisting of twenty-seven items each, were administered to under- 
graduates who were applying for admission to the teacher credential 
program during the fall quarter of 1971 , at the California State* 
College, T.os Angeles. As part of the application procedure, students 
are administered a scries of tests in various academic areas. They 
were .informed that the test used in this study was experimental, and 
was being administered in order to determine how wo 1.1 the test worked. 
They were told, furthermore, that the experimental Lest would provide 
them with practice in some of the skills that they would need to use 
on an arithmetic skills-and-conecpts test that would be administered 
to them about a month later. The latter test is considerably easier 
than the one used in this study and is used to determine eligibility 
for the credential, program. Observations of the examinees while 
they were taking the experimental test indicated that they were well 
moti vated . 

The two forms were administered separately with one week be- 
tween administrations. Ninety-nine of the examinees were present 
for the administration of only one of the forms, and their responses 
were excluded from all analyses. Since conventional item-analysis 
may not be meaningful if the data are obtained under speeded con- 
ditions, tlu; responses of the forty- two examinees who did not mark 
at least one of the last three items on both forms were also 
excluded. Consequently, the results reported in this study are 
based upon the responses of 30*1 examinees who marked an answer to 
at least one of the last three items on both forms. 



Computation of' the Measures of the 
Quality of the Special Items 

In order to compute the conventional, discrimination index and 
three of the four new measures of item quality, a criterion measure 
of the examinees* over all. ability in arithmetic reasoning was needed. 
In this study, scores on the nine ^fault-free” items in one parallel 
form were used as the criterion in computing the indices for each 
item in the other parallel form. These scores were corrected for 
chance success. The parallel-forms reliability coefficients for 
the two nine-item forms were found to be .502 and .579 in Lwo non- 
overlapping random halves of the examinees. Since only two hours 
oT testing time were available, it was not possible to include a 
larger number of items intended lo be ’’fault-free”. 
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Conventional clisetvimj.nati.on indices, which estimate the 
degree of relationship between marking or not marking the keyed 
choice and scores on the criterion variable, were obtained for each 
item by means of an item-analysis computer program. This program 
expresses the discrimination index in terms of a point-biserial 
correlation coefficient. Since an "external” criterion was used 
(.i.e., scores on the* nine .items designed to be ’'fault- free” in a 
separately administered parallel, f own) t the spurious inflation of 
these coefficients that would hove occurred if part-whole correla- 
tions had been used was precluded. It is widely- recognized , however, 
that the values of the point -hi serial coefficient are related to 
item difficulty. In order to obtain a measure of item discrimination 
that .is less related to .item difficulty, the point-bisorinl. coeffi- 
cients were converted to bi serial correlation coefficients. The 
biscrial coefficients subsequently were converted to Davis Discrim- 
ination Indices, which are described in detail, elsewhere (Davis, 
1.9*19). The essential characteristics of these indices are that 
thevir values constitute an .interval, scale and range from 0 to 10(1. 

The four now indices of item quality wove computed. These 

are: 

Index 1 (!]) 

(C)-cp 

( 1 = 2 ) 

where 

I] is Item Quality Index 1: 

C] is the choice weight for the keyed choice; 

Cj is the choi ce weight for choice .i , where "omits” are 
treated as choices and i ~ 1 : 

k is the number of choices. 

If the choice weights are on a 7-point scale from +3 to -3, 
for five-choice items, the maximum value of Index 1 is +29 (where 
Ci-3 and Cj=-3 for al l. values of i) ; the minimum value* is -29. The 
higher the value of 1^, the more likely it is that those who are 
high on the criterion variable are attracted to the keyed choice 
and that those who ar*c low on the criterion variable are attracted 
to the distraeters. Thus, the value of I] for any given item 
indicated that the extent to which it differentiates between those 
who know the answer and those who do not. It has long been an 
accepted principle of item writing that items should make this 
differentiation, and the extent to which they do is shown by 
Index I], from this basic principle are derived many specific 
rules for item writing. 

Index l m (Iln.) 

^ 1 m = *-l“ *-a 



11 



IS 



where 



C a is tlie choice weight for the most attractive this trader. 

If the ehoi.ee weights tire on a 7 -point scale from +3 to -3, 
tlio maximum value of Index l.m is +G anti the minimum value is -G. 

The higher the value of this index, the more likely it is that those 
who are high in ability are attracted to the keyed choice and that 
those who are low in ability arc attracted to the most attractive 
distracter. This .index is n modification of Index 1 and was devised 
after an inspection of Lite choice weights* and frccjucnci.es for the 
choices :in several of the experimental items in form 13. This in- 
spection revealed that in some items several, distracters were 
selected by very few examinees. In computing Index 1, the choice 
weights for such ineffective distracters were given equal weight 
witii highly effective distracters. Index l.m is less subject to 
this problem. 



Index 2 



(*2) 



k 

I2-C 

(5=2) 



k 

z 

Ci«3) 





(1 r j) 



If tlie choice weights are on a 7-point scale from +3 to -3 for 
fivc-choiee items, the* maximum value of is +2 l l (where C2 and 
Cg-3 and Cq arid 65 *-3). The maximum value :is zero (when al l. 
values of C are the same). Therefore, the higher the value of I 2 - 
thc more* likely .it is that the distracters are attracting groups 
of subjects who differ with respect to their mean criterion scores. 
The basic assumption underlying, the formulation of this index is 
that iu an item of high quality, the distracters should discriminate 
among those who don’L have sufficient knowledge to select the correct 
response hut have varying amounts of information or misinformation. 
That is, each distracter should attract examinees a l a different 
average level, of ability than the other distracters. It is assumed 
that items with this characteristic will he especially effective in 
terms of providing plausible distracters for examinees who do not 
thoroughly know the point in question. It should be noted, however, 
that when Llic value of this index is at a maximum, the value* of 
Index 1. cannot be at a maximum. This restriction docs not apply 
to Index l.m. 



Index 3 
*3 

where 

*i 

k 



(1 3) 

k 

=-r 

(i=2) 



k 

L 

11=21 



k - 1. 



- f i 



is the frequency for the keyed choice; 

is the frequency for choice i , where "omits ■' 
treated as choices ; 
is the number of choices. 
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Whatever Hie percent of examinees who choose the keyed choice, 

1 3 will equal zero when equal, per cents mark all dis tractors . Its 
value will bo larger in the negative direction when this condition 
does not exist. Index 3, there Tore , is a measure of the extent to 
which the dis tractors in a multiple-choice item nvc equally attractive 
Horst (1.033) has shown that, other things being equal, item scores 
will tend to Ivj more reliable for items where the d is tractors are 
more equally attractive than in items where they are less equally 
attractive. This occurs regardless of the level o f difficulty. 
Furthermore, it is widely recommended that dis traders that attract 
very low or no examinees probably should be regarded as implausible 
and be replaced. Items with such dis tractors will tend to have 
larger negative values on Index 3 than items that do not. 

Judgment s of the Q uail tv 
o f the Specia l Items 



In order to obtain a criterion to use in determining the 
validity of the objective measures of.' item quality, three judges 
were asked to rate independently each of the fifty-four special 
items for quality by using a special check list of the nine 
character:! sides of items discussed earlier (See Append ix B) .“ 
Specifically, the judges were asked to indicate which, if any 
faults were present in each item and the extent to which each 
fault would he likely to affect adversely a given item’s ability 
to discriminate between those who know and those* who do not know 
the point in question. The extent to which each item’s ability to 
discriminate was .impaired by each fault was indicated on a throe- 
point scale consisting of these categories: "not detrimental," 
’’moderately detrimental.,’’ and "seriously detrimental.” Furthermore , 
the judges were asked to explain the nature of each fault that they 
found. 

Originally , it was planned to give cneli item a score oT one 
point for each moderately detrimental faul t and a score of two 
points for each seriously detrimental fault. ]n 20 of the 171 
ratings of individual items, however, a given judge gave the saint; 
explanation for marking two or more faults for a given item. Tills 
occurred most often in response to scales five and six even though 
these scales were worded in a manner designed to preclude this 
occurrence. This raised the problem of whether an item should 
accumulate points under various headings on the chock list for a 
single characteristic . It finally was decided that whenever two 
ov more faults were marked for a given item by a single judge and 
the some explanation was given for the various faults, the multiple 
faults would be counted only onec. It is interesting to note, in 

"Charlotte? Croon Davis, Test Research Service: Gordon Fi.for, limiter 
College, City University of New York; and Mary P». Willis. American 
Institutes for Research, Palo Alto, served as the judges of the 
quality of Lire items. 
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rc> L rospce L , that* this problem could have been avoided by having the 
judges check oFf the faults that they found in each item, but give 
only one over all. rating of the likely effects of all faults on 
each item's ability to discriminate, 

* The scores obtained by each item v.*ero averaged in order to 
obtain a single criterion measure of item quality. To make higher 
average values indicate higher quality than lower average values, 
the average scores for each item were subtracted from a constant 
positive number that was larger than any of the average ratings. 

Despite the relatively minor problem that arose in obtaining 
scores from the ratings, inspection of them reveals that the judges 
possess cons iderable insight in to the desirable charac Leris ties of 
multiple-choice items and the probable reactions of examinees to 
them. Furthermore , the reliability coefficients for the average of 
the judges’ ratings were computed to be .007 and .S57 for forms A 
and B, respectively, which arc high considering the types of •iiuV- 
monts involved. 
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CHAPfLR M 



TIIK FINDINGS 



For each of the specially prepared items, six scores were 
available: Indices I], J] m , 12’ 3 3? the Davis Discrimination Index 
(DD3.SC) : and the I tem-Quality Rating (IOU) obtained by averaging 
the ratings of the three judges. The sample of examinees was 
divided in half at random for subsequent cross validation, and Lite 
indices were computed separately on the basis of the responses of 
rand on*, halves (1 and 11) of the examinees on each form (A and }*) 
of the test. 

The criterion used in computing Indices I], I] m , I 2 , and the 
Davis Discrimination Index for each item consisted of the; scores 
on the nine items intended to be ’’fault-free” on the parallel form 
of the test. The mean scores corrected for chance success on the 
nine items on Form A of the test were 3.30 and 2.80 for halves 1 
and II of the examinees: the associated standard deviations were 
2.37 and 2.32, respectively. On Form 15, the mean corrected scores 
on the nine items wore 2.23 and 1.77, and the standard deviations 
were 2.05 and 2. Ml. in halves I and II, respectively. 

The mean corrected scores on nil twenty-seven items on Form A 
of the test were S.02 and 0.31 for the two halves of the examinees: 
the associated standard deviations were found to be M.MQ and *1.35, 
respectively. On Form 15, the mean corrected scores on all items 
were S.00 and 0.7S, and the standard deviations were 0.70 and M.MQ. 

The first step in the. analysis of the data was to obtain the 
inter oorre la Lions of the six indices of item quality separately for 
each random half of the examinees on each form of the test. Tables 
1 through 'I present the intereorrclatjons along with the means and 
standard deviations of the variables. The columns labeled ’’1QR'' 
show the validity coefficients for the indices. Inspection of the 
scatter plots for these relationships indicate that they are not 
eurvi linear . 



The Validity of the Conventional 
Index of Item Quality 

The conventional discrimination index was expressed in terms 
of the Davis Discrimination Index (DDISC) . For Form A of the test, 
the validity coefficients for this index were . 5M0 and .Ml 8 for the 
two random halves of the examinees. Using the appropriate 7 : trans- 
formation, the average of these coefficients was . L l S 5 . This value 
is significantly different from zero at the .01 level. 

For Form B of the test, the validity coefficients for the 
Davis Discrimination Index were .MRS and .572 For the two halves 
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TABF.U 1. 



intrrcourklations , n, and standard dcviations for tiii: six 

VAR T AUDI'S FOR RANDOM HALF I ON FORM A OF TIIL 'J’LST. 



I L I ]n , I 2 3 3 DDISC IQR 



M 



SD 



II 



.075 -.285 .370 . S29 .982 53.185 32.550 



' 1m 



.0S9 .5*15 .901 .*189 6.920 7.81.7 



- . OS2 -.100 .207 77.111 90. *110 



.623 .120 -59.815 99.002 



DDISC 



.590 15.852 11.930 



IQR 



3.99*1 1.207 
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TAI1I.F. 2 



T NTKRCOKR 1 11 ,AT.T 0N8 , Ml IANS . AND 
VARIAill.RS FOR RANDOM JIAT.F 



STANDARD DDVIATI ONS FOR THF. SIX 
II ON FORM A OF Till'. TK8T. 



h. 



J 2 



DDISC 



IQR 



I] ] ]m 1 2 1 3 DDISC IQR 



.020 . IBM -.OMS .771 .MSG 



-.191 .450 . R7G .MSG 



-.431 -.226 .121 



. M5G .071 



.MIS 



M 



G2.G30 



9.29G 



91 .593 



-52. IMS 



1G.SS9 



3.MMM 



SD 

38 . MSS 
9. 033 
M1.28M 
50.M30 
12.673 
1.207 
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TAIiLE 3 



INTERCOMUIEATIOXS, MEANS , AND STANDARD DEVIATIONS FOR Till: SIX 
VARIABLES I'OR RANDOM HALF I ON FORM R OF THE TEST. 



Iin 



DDISC 



IQR 



I-j l' lni 1 2 I, DDISC IQR M SD 



.827 .010 .102 .807 .017 58.000 38.530 



- . 20S . 3*13 .030 .503 9.51S 9.G5S 



-.207 -.188 - .073 90.503 01.130 



.301 .023 -50.063 30.603 



.572 15.333 11.820 



3.700 .987 
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TABLE 4 



INTERCORRELATIONS , MEANS , AND STANDARD DEVIATIONS FOR THE SIX 
VARIABLES FOR RANDOM HALF II ON FORM B OF THE TEST. 



I l I ]m 1 2 I 3 DDISC IOR 



M 



SD 



730 .01.1 .157 .870 .3GO 60.290 32.592 



■].m 



.330 .305 .922 .919 10.778 9.378 



-.232 -.331 -.287 85.037 35.596 



,182 .350 -97.518 91.517 



DDISC 



.988 18.222 11.789 



IQR 



3.790 .987 
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of the examinees. .The average of these coefficients was .530, 
which is significantly different from zero at the .01 level. 

Xri summary , the conventional discrimination index was posi- 
tively correlated with the criterion variable at the .01. level. 

It is important to note, however , that approximately three-quarters 
of the variation of item quality, as determined by judges' ratings, 
remained unexplained by this index. 



The Validity of the New Indices 
of Item Quality 

Tables 1 and 2 show the i.nt creorre la t ions , means, and standard 
deviations of the indices for the two random halves of the examinees 
on Form A of the tost. With respect to this form, all of the new 
indices have positive validity coefficients. Only the coefficients 
for Indices I] and Ii nt , however , are of appreciable size. 

Tables 3 and l l show the inter eoxTolati ons , means, and standard 
deviations of the indices for the two random halves of the examinees 
on Form 13 of the test. With respect to this form, Ig luis negative 
validity coefficients. Possible reasons for this unexpected finding 
and suggestions for a future study of this index arc discussed in 
the next chapter. 

The multiple-correlation coefficients between the best-weighted 
combination of If, I]. m , 3.2, and -Is and the Item- Quality Rating for 
the two halves of the examinees on Form A were .693 and .632, respec- 
tively. On Form B the multiple -correlation coefficients fox' the two 
halves of the examinees were .689 and .594, respectively . 

To eliminate the capitalization on chance elements that causes 
spurious inflation of multiple-correlation coefficients , cross- 
validated correlation coefficients were obtained by using the beta 
weights obtained in half I of the sample with the intercorrelations 
and validity coefficients of the variables, in half IT. of the sample. 
Likewise the beta weights obtained in half II of the sample were 
used with tlie intercorrelatious and validity coefficients of the 
variables in half I of the sample. 

Strictly speaking, the resulting cross-validated coefficients 
are produet-moment correlation coefficients between standard mea- 
sures in the criterion variable (denoted in the following equations 
as e) and a weighted sum of standard measures in each of the pre- 
dictor variables (the four Indices, denoted .in the following equations 
as variables 1, 2, 3, and l l) where the weights are the partial re- 
gression coefficients in standur'd-mcasure form (beta weights denoted 
in the following equations as , 0>2 > 1^3 j pi|) . The equation for 
obtaining the cross -validated coefficients, written for* sample I 
intercorrelatious and validity coefficients and sample II beta 
weights in Foimi A of the test, follows: 
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(1) A** (I* e ) (I A I*l + lA i z 2 + l A I z 3 + iA 1**1) = 

JxA I r el + lA J. v o2 + lA I r e3 + lAl I r e4 



I A 2+ T T^2 2+ 1 A 2+ I A^ 

+2 (I A I A i>']2 + uA .[ A I r 13 
+ ].xft iAi J r 12 + lA I A I r 23 
+ ] A I A, i r 24 + l-A I.Al I r 3M) 



(dof = nj - 2) 



Annlogous ccjiiat j nns provide similar* da l a for sample IX intercior- 
rcl.tit.ions and validity coefficients used wi th sample X beta weights for 
the 27 items in Form A of the test; sample I intercorrelations and 
validity coefficients used with sample IX beta weights for the 27 .items 
in Form 11 of t lie test; sample II intereorrel at.ions and validity coeffi- 
cients used with sample I beta weights for the 27 items in Form B of 
the test. 

The results of those computations are as follows: 

(2) A 1 ' (I*C) C-Tirl I Z 1 + lA 1*2 + lA 1*3 + X.A 1*4) = .61.5 

(dof " nj - 2) 

(3) A r (xx *e) (A 11*1 -i A 11*2 + A 11*3 + Al II *M) = .MS9 

(dof = n i: , - 2) 

00 * r Q Z(i ) CxAl 1*1 + I. A 1*2 + I A 1*3 + I A 1*4) = .607 

(dof = n 3 -2) 

(5) B r (II / ‘ C ) (J. 1 XX *1 + A 11*2 + A 11*3 + Ai Xl*i|) = . l l9G 
(dof = nj] - 2) 

Before obtaining the average cross- validated coefficients for 
Form A and Form B, it should be determined whether the coefficients 
yielded by equations 2 and 3 are significantly different and whether 
those yielded by M and 5 are significantly different. The .09 level 
of significance was used in making this decision. 

All four coefficients of interest are product-moment correlation 
coefficients; consequently, they may legitma tcly be converted to 
Fisher’s z statistics (the hyperbolic are tangent). The appropriate 
Jt test, expressed in notation appropriate for testing the significance 
of the difference between the two coefficients based on nonoverlapping 
samples who took Form A is: 
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A /j (l /c ) fj-jA) 1*1 -!• etc.) - A / CII /,e ) (A i l/'I + etc.) 

s. 

a / q/ c ) (:i:A i 7 'J + etc.) - A / (n y c ) , :L *j + e to.) 



1 1 j — 3 



n n -3 



(cl of = ii j + n^. - 6) 



n 

where, in this; case, n equals the; number of items .in 1‘orm A. 

Use of equation 0 .indicates that the cross -valid a ted corre- 
lations. of the best-weighted combinations of the four indices and 
the criterion variable (the judges* ratings) for Form A were not 
significant ly different at Llic .05 level. Similarly, use of the 
appropriate analogue of equation 0 shows; that the two validity 
coefficients for Form ]> are not significantly d if revent at the .05 
level. Consequently, it is legitimate to combine the data for the 
two coefficients pertaining to Form A and to Form 11 to obtain one 
product-moment coefficient shewing, for each form, the extent to 
which the be sit -weighted combination of standard measures corres- 
ponding to the four item indices correlate with the criterion 
variable . 

The required equation for the within-sequences correlation 
coefficient, expressed in terms of Fisher's z statistic! arid 
written in notation appropriate for Form A is as follows: 

(7) within -group ^ (*»c0 (ft *1 + f^2 z 2 + ft 3 y 3 H A|*q) = 

(samples I and 

II in Form A) , , w , 

(ni- 3) (zj) + (n n -3) (z 2 ) 



( n I“3) + (nn-3) 



(dof = iij + njj - 3 - 3) 



V/lien equation 7 and its analogue for use with Form P. are used, 
the weighted combination of standard measures of the four indices 
yield correlations with the criterion of . 5 ‘■I 1 1 for Form A and of .592 
for Form B . 

Since these arc product-moment correlation coefficients, each 
with 98 degrees of freedom, the difference between each of them and 
a true coefficient of zero may be tested with the usual equation: 



(8) 


t = V 


CM 

\ L 






] - r* 

N J J s 



Use of equation (8) indicates that the correlations for both 
Form A and Form 15 are significantly different from zero at the .01 
level. 

The beta weights for obtaining the best-weighted combinations 
of the new indices For Random Halves 1 and II on Form A of the test 
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computed to lie: 




I 



lin 



I 



3 



.448 .080 . 421 .258 

.2011 .375 .1117 -.028 




weight's must be interpreted with caution because the size of the 
weight for each index is dependent upon the relationships among the 
particular predictors that were used in this study as well, as its 
validity. Thus, in a strict sense, these weights indicate the 
relative contribution of each of the new indices only when all 
these and only these predictors are used. 

A measure of the imparlance of each predictor that is inde- 
pendent of the relationships among a given set of predictors is 
the squared validity coefficient. This indicates the amount of 
variation in the criterion scores that each predictor independently 
explains. For the two halves of the examinees on Form A of the 
test these are : 



These indicate that in absolute terms both Indices Ij and ] V.. are 
relatively good predictors of item quality and arc about equally 
effective. The beta weights examined previously show, however, that 
when these two are used in the set of four new predictors, they are 
differentially effective because of the nature of the relationships 
among the predictors. These relationships ai*e shown in Tables 1 and 
2. It is also .interesting to note that for the first half of the 
examinees, I2 received a relatively large beta weight even though 
its squared validity coefficient is low. For the second half, 1 2 
is not particularly important by either measure. I3, furthermore, 
docs not appear to be particularly effective in terms of its squared 
validity coef fi orients . 

The beta weights for obtaining the best-weighted combination 
of the new index for Random Halves I and IX on Form B of the test 



all these and no additional, predictors are used to obtain the best- 
weighted combination of the predictors. 

The squared validity coefficients for these predictors with 





.071 



1 



2 



.010 



.015 



.005 



are: 



■*•1 ■'-lm -*-2 -*-3 



.080 .302 -.305 .180 

.094 .300 -.127 .284 



These indicate the relative importance of the new predictors when 
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respect to Form 11 are: 








I 1 


In, 


1 2 


I 3 


.1.74 


.352 


.224 


.179 


.1.30 


.1.71 


.082 


.127 



These inilicn te I lie relative :i import n vice of the predictors .if each is 
to be used alone. 



Com pn risou of the Val icli ty o f 1 1 ie 
Now Indices wf tl 1 the Conventional. Index 

for Form A of the test, the average of the validity coeffi- 
cients obtained for the two halves of examinees for the conventional 
discrimination index was found to be; .485. The average cross- 
validated multiple-correlation coefficient that indicates Liu; 
validity of the weighted combination of the new indices for this 
form was found to be . 5*lM . The coefficients of determination indi- 
cate that, on the average, the weighted combination of the new 
indices explain thirty percent of the criterion variance while the 
conventional index explains twenty-one percent. 

With respect to Form 15, the average of the validity coeffi- 
cients obtained for the two halves of examinees was found to be .530. 
The average cross-validated multri.plp-oorvol.ati.on coefficient between 
the weighted combination of the new indices and the criterion was 
found to be .592. For this form, the new indices, on the average, 
explain thirty-five percent of the criterion variance while the 
conventional index explains twenty-eight percent. 

In conclusion, the validi ty of the weighted combination of the 
new indices appears to be somewhat bettor than the validity of the 
conventional cl is crimination index. However, Index 2 had negative 
validity coefficients with respect to Form B. Consequently, the 
nature of this .index’s contribution to the weighted combination of 
new indices :i.s not clear from a logical or theoretical point of view, 
in light of this fact, it was decided not to statistically determine 
the significance* of the differences in validities of the weighted 
combinations and the conventional index for each form. That "is , 
oven if the differences were shown to be statistically significant, 
they would not be of any practical significance without a rather 
thorough understanding of the nature of the superior indices. This 
point is discussed in greater detail in the next chapter. 



24 



01 



CHAPTER 5 



SUMMARY, DISCUSSION, 
AND CONCLUSIONS 



Si unmet ry 

oh'.'-oo D items° faul ty^i tSS n co^f I,U8i f ° n avoIdin S fo “lts in multiple- 
Consequently it seemed clesir-Me to i ai,p ?V r J" standardised tests. 
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Vc jiclj.ty of fouv now measures of item quality Thmo /,r +1 “ ° 

sriteinte ifrj «**«*» ^ ° c 

relative atlroctlvcne.. of the dih-raeters In Items ' Up0 " the 
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EfffiS S“i*\ "?* “« is&ss 

specific y f0U,ty "*■*“* * 

three ™® *?«« independently by 

« they P ?o n i ^0 h itrfl T**- ^ *° describe' each Inn t 

each itof’ conventional discrimination index was com™, ted for 

Index! and tl,e olitIXTo? ^ ,°?‘ he tevls OJ«eriiina?iS!‘ 

each item in a given form oons'istedVf b ) 1 | l]ty l,SKl t0 co, " l>ute Jt for 
in the parallel 8 ^ £,'% JTSlSeS! " 1 "* 

item, with the scores 6 o^the f n * tem - Qu ? i f ty were computed for each 
form as the criterion of examinee 2 iuJT* J*r* “V'® Parallel 

th : h “ t»^ich~ch nc iM^i^ 0 r rc 

thoroughly know the point in n nn I!* C an, °' 1 fe> ^ le examinees who do not 

~ii t^sriM^s; « izi*x as ■ 

were available- the V^a t Jl B . ?PJ clnll y Prepared items, six scores 

ratings of" the 'three juXes-’the Dwis%i^rimi d r av ^ ra Sing the 
fonr new indices. ^ examinee^ 
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subsequent cross validation, and the indices were computed separately 
on the basis of the' responses of each random half of the examinees 
(I and IT) cm each form (A and B) . 

The first step of the analysis was to obtain the intercorre- 
lations of the? variables named above. Inspection of the correlations 
between each of the? indices and the? Item-Quality Rating indicates 
that the Davis Discrimination Index and Indices I.^and Ij arc moder- 
ately valid measures of item quality. Furthermore, the intercorve- 
lations among these three indices arc strong and positive. 

Index 2, on the other hand appears to be operating differently 
from the way expected. In fact, with respect to Form B, the values 
of this index were negatively related to the Item-Quality Rating. 
Possible reasons for this result are discussed in the next section 
of this chapter ns well as suggestions for future studies of this 
index. 



The? validity coefficients for Index 3 are in the expected 
direction but are disappointingly small. This finding is discussed 
in detail in a later section of this chapter. 

The second step in the analysis was to determine the multiple- 
correlation coefficients of the new indices with the Item-Quality 
Rating for each half of the examinees on each form. The cross- 
validated multi pl.e -correlation coefficients indicate that the 
weighted combination of the new indices is a moderately valid 
predictor of item quality. The usefulness of this result is severely 
limited by the fact that the way Index 2 operated in this study is 
not fully understood. That is, given the? results of this study, 
there is no theoretical or logical basis for using this index as a 
measure of item quality. 

In conclusion, the conventional discrimination index appears 
to be a moderately valid measure of item quality. A substantial 
amount of the variance in the judges’ ratings, however, remains 
unexplained by the index. This suggests that further research on 
other indices of item quality is desirable. Furthermore, the' new 
indices investigated in this sutdy, in general, appear to be 
promising measures. Further research will be needed, especially on 
Index 2, before recommendations can be made regarding the use of the 
new indices in operational settings. 



Discussion of Factors to be Considered 
in Future Studies of Index 2 



With respect to Form A of the test, the correlation coeffi- 
cients between Index 2 and the average of the judges' ratings for 
the two halves of the examinees were positive but weak. On Form B, 
the relationships between I2 and the criterion were negative, and 
for one random half of the examinees, the negative relationship was 
substantial in size. These findings were disappointing since strong 
positive relationships were expected. Consequently, it is desirable 
to reexamine the assumptions used in the formulation of this index 
and the methods used in this study to determine its validity. 
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Tlici basic assumption used in the formula Li on of Index 2 is 
that each d is tractor should attract examinees at a different average 
level of ability than the other distracters. It is interesting to 
note that this assumption is compatible with the widespread assumption 
that the righ L-or-wrong distinetion for a given item is an arbitrary 
dichotomy and that the ability of examinees with respect to the point 
in question is in reality normally distributed. If this latter 
assumption is true, it should be possible to write ari item to test 
a given point in which the distracters attract examinees at different 
levels along a eon linn uni of ability. Logically, such an item should 
be especially effective :in providing plausible alternatives for 
examinees who do not have adequate information to select the correct 
choice . In retrospect, therefore, the basic assumption underlying 
Index 2 still seems reasonable. 

Inspection of data for individual items, however, indicates 
that large differences in the choice weights for the distracters 
may occur as a result of several types of faults in items. With 
respect to this possibility, consider the choice weights for the 
second item shown :in Appendix A. Di Streeter C, on the average, 
attracted examinees at a higher level, of ability than the keyed, 
choice. Close inspection of the item indicates that there is an 
ambiguity in the stem that incikos choice C defensible as the correct 
answer. Consequently, the large weight for distvacter C appears to 
be the result of a fault in the item, and when the weight for this 
d is tractor is subtree; ted from the weights for the other distracters, 
large remainders are obtained, wh.ic;h increase the value of I 2 . The 
undesirable influence of certain kinds of faults such as that dis- ' 
cussed above could bo controlled, to some extent, by ignoring the 
choice weight for any dislraeter that has a larger weight* than the 
keyed choice in the computation of Index 2. The assumption under- 
lying this provision for a modification of Index 2 is Hurt the 
weight for the keyed choice in a given item should be larger than 
the weights for any of the distracters, which is the basic assump- 
tion for Indices 1 and lm. 

In retrospect, it seems possible that the difficulty levels 
of the items may have had an undue influence on the rank ordci* of 
the items on Index 2. Specifically, it is unlikely that the value 
of I 2 will be large for a very easy item, regardless of its quality, 
since those that do not know the point in such an item probably 
represent a narrow range of ability. Although the items in this 
study were , on the average, rather difficult, there was considerable 
variation in the difficulty of the items, and this variation may 
have accounted for a substantial amount of the variation in the 
values of I 2 . 

Finally, weaknesses in the criterion used to determine the 
validity of Index 2 may have contributed to the negative results . 

The judges were asked to determine whether nine types of faults 
were present in the items and the extent to which each fault probably 
would effect the item's ability to discriminate between those who 
do and those who do not know the points in question. While this 
seems to be a reasonable criterion of the over all quality of test 
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items, it: does not deal directly with the characteristics of items 
that are likely to influence the values of I 2 . Specifically, the 
directions to Lhe judges emphasized the quality of the keyed choice 
and its relationship to the dis tractors and stem of a given item, 
rather than emphasizing the relationships among the distrueters in 
terms o f their relative effects on examinees. A rating scale that 
emphasized the latter relationships may have led to different average 
ratings for at least some of the items. Consider, for example, 
ratings that dealt with the plausibility of d 1st rectors on scales 
five and six in the present study. An implausible distune tor in a 
given item was likely to lead to a low quality rating for . that item. 
Vet, in terms of the considerations underlying the formulation of 
Index 2, a single implausible distract or does not necessari ly reduce 
the quality of an item as long as the distcucter is effective in 
attracting some of the examinees and as long as other distractcrs 
arc present that are effective in attracting examinees at higher 
levels of ability. It is interesting to note that one implausible 
distractcr was incorporated into each item in an early study of 
choice -weight: scoring (Nedelsky, 1.95 1 !) . Such distrueters were in- 
cluded in order to identify examinees at very low levels of ability. 

In summary, despite: the disappointing results regarding 
Index 2 in this study, the index still appears to be reasonable from 
a subjective point of view and probably deserves, further investiga- 
tion. In future studies, it is suggested that in computing Index 2, 
choice weights for distrnoters in a given item that are larger than 
the choice weight for the keyed choice should not be used. Further- 
more, it is suggested. thaL the items in such a study should be 
relatively homogeneous with respect .0 difficulty and be of medium 
or greater difficulty. Filially, the criterion of item quality should 
be redefined in terms of the basic assumptions underlying the index. 



Discussion of factors to be Considered 
in Future Studies of Index 3 

The validity coefficient's for Index 3, which is a measure of 
the relative attractiveness of the distractcrs .in a given item., were 
not as large as expected. In the present study, at least two factors 
may have led to the poor results . First, the values of this index 
may have been unduly influenced by the variation in item difficulty. 
That is, the fact that frequencies were used in computing this index 
make its values dependent, to sonic extent, upon the number of people 
who mark the item incorrectly. Specifically, it is not possible for 
an easy item to assume a large negative value on I3, regardless of 
its quality, since the average number of examinees that mark di.s- 
traeters in the item will, be low, and the deviations from this value 
must be small. The importance of this restriction was not recognized 
when this study was planned. 

Furthermore , the criterion used to determine the validity of 
the index did not deal directly with the relative quality of the 
distractcrs in terms of their attractiveness, but rather with the 
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quality of the keyed choice and its relationships to the ills tractors 
and the stem. Consequently, it is suggested that in future studies 
of the validity of Index 3, items be used that arc relatively homo- 
geneous with respect to difficulty and that a criterion be employed 
that deals more directly with the quality of the dis tractors . 



Conclus i ons 

Several general conclusions seem appropriate as a result of 
this study. First, the conventional discrimination index appears 
to bo a reasonably effective measure of item quality. In this 
study, much of the variation in item quality, however, remained 
unexplained by this index. Secondly, the new indices appear to be 
promising as measures of item quality. Additional research, however, 
is needed in order to fully understand them and to determine their 
value in regular test-construction projects. 
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SAMt'Jj: ITEMS D1SIGKPJ) TO VARY 
IN QMAI.m AND Til KIR C1IOICK WEIGHTS 



I terns 

A. 'Tan I t- free " 



WoJ»hl:s 

S amni o jj 



1. 



Wh.vto tile costs 11 cents per 9- inch 
square , while colored tile costs 13 
cents for a square of the same size 
low much more triU it cost to cover 

cLV , ? 0P .°f 9 S,,0WCV 1,ooni 3Cl feet bv 

1. 1 3e^° L U:I l l ooloved i»»«t-cad of white 



A. $5.70. 



B. $11.52 


GO 


57 


C. $21.87... 


<10 


58 


D. $3*1 . 56 


52 


50 


* E. $G9 . 12 


50 


59 


Omit 


75 


80 


Moderately faulty 
(ambiguous stem) 


55 


GO 



2 . 



Mnk which sells for 20 cents a quart 
is on sale for 70 cents a era l Ion ir ot , 

18°i'„”T y ? 0U,Cl yOU s*'"-' if you 'bom. It 

18 quarts of milk at the sale? y 



A. 


$1 


.10 


92 




B . 


$ 


.90... 


61 


C. 


$ 


.95 


91 .... 


.... 99 


D. 


. $ 


.90 


63 .... 


65 


E. 


$ 


.os 


60 .... 


69 






Omit. . . . 


95 .... 


42 








93 .... 


50 
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We .i splits 



Items 



Sample I Sample II 



C. Serious! y fault y 

(inadequate keyed choice) 

3. In a certain state. $1,000 of a man's 
income is no t taxed. All of In’s 
.income over $1 ,000 is taxed at 20 per- 
cent, and all over $2,000 is taxed 
*1 percent additional. His state 
.income tax is $500. If you let X 
equal the amount of Iris income over 
$2,000, which one of the following 
equations is true? 



A. 


. 2 & * 


.n<i (x + looo) = son 


50 


52 


}). 


.2 OX -i 


• 1000 + .0‘IX « 500 


58 


5*1 


C. 


.2 OX -l 


• .00 (X - 1.000) = 500 


57 


63 


n. 


.20 (X 


- 1000) H- .0MX “ 500 


02 


70 


K. 


.30(1000) + .0'IX = 5 DO 


72 


58 






Om i t 


57 


62 
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APPENDIX B 



1.TCM- QUALITY CHUCK LIST 



D IR I'.C'J* J ONS J'-’OR OUD GIIS : On the fol lowing pages you will find ari th- 

metic reasoning items intended for use with college sophomores. 

Below each item is a list of nine common faults in multiple-choice 
items . 

It you think that a particular fault .is present in an item, estimate 
how detrimental it will be to the item's ability to discriminate 
between those who know and those who do not know the point being 
tested. If you think the fault will not be detrimental, place a 
check beside "Not detrimental”; if you think that it will he moder- 
ately detrimental, place a cheek beside "Moderately detrimental." ; 
and if you think i l will be seriously detrimental , place a chock 
beside "Seriously detrimental”. 

for each fault you find, specify in the space provided the part of 
the item that is faulty and why you think it is faulty. 

An answer key for the items is enclosed on a separate sheet. 



ITEM: A 21. Milk which sells for 2C> cents a quart 

is on sale for 70 cents a gallon. How 
much money could you save it you bought 
18 quarts of milk at: the sale? 

A $1.10 

u $ . no 

C $ .'IS 
1 ) $ .‘10 
j: $ . os 



1. Inadequate keyed choice 

N at cl c t r iniei via 1 

Modern! cly detrimental. 

Sei’i ousl y detrimental 

Explanation : 



2. Di straclers that can be depended ns adequately correct due to 
ambiguity in expressing the meaning of the stem and choices. 

N o t cl e t r .i mental 

Mod crate ly d c tr inicnti 1 1 

S e r i on s ly do t r :i.mc 1 1 1 a 1 

Explanation: 



3. l)i s tractors that can be defended ns adequately correct even 
though s tem and choices are unambig uous . 

Not detrimental 

Moderately detrimental 

Seriously detrimental 

Explanation: 



l l . Ambiguity caused by the use of a negative or double negatives. 

N ot dot r imo i i t a 1 

Modern te ly detrimen t a 1 

Seri on sly d e tr i mental. 

Exp la n a t i o n : 



5. Implausible distractcrs clue to a lack of homogeneity with each 
other and with keyed choice. 

No t cl e t r iniei i t*a 1 

Mod e v a t c; 1. y detrimental 

Serious ly detrimental 

Explanation: 
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IT KM: A 21. 



Milk which soils for 20 con ts a quart' 
is on sale I’or 70 cents a gallon. How 
tmioli money could you save if you bought 
of milk at the sale? 



6 . 



18 


cjuni 


•ts 


A. 


$1. 


10 


I). 


$ - 


90 


C. 


$ • 


MS 


n. 


$ . 


,'(0 


K. 


$ . 


OS 


c 


(list) 


’no 



t ive c1:i.st meters . even thfmgh frl.1 choices are rein tively 
homogeneous . 

Not detrimental 

Moderately detrimental. 

Seriously detrimental 

Explanation: 



/. Long or precisely worded keyed choice. 

N of detriment a 1 

Mod era te ly d e t r 1 men l a 1 

S erious ly detr :l mental. 

Explanation: 



8 . Log! ea 1. 1 y over! app.i ng d 1 s tractors . 

Not detrimental. 

Modern to 1 y do t r i me nta 1. 

S e ri ous ly cl etr 5 men La 1 

Explanation: 



9. Lack of grammatical, agreement of stem with choices. 

Not detrimental. 

Mod crate 1 y de Lr i men to 1 

Seriously detrimental 

Explanation: 



